{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "2f09c6e2-d1d3-4fc3-9be4-3141425a2f50",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-10-01T08:57:57.470902Z",
     "iopub.status.busy": "2024-10-01T08:57:57.470238Z",
     "iopub.status.idle": "2024-10-01T08:57:57.489907Z",
     "shell.execute_reply": "2024-10-01T08:57:57.488165Z",
     "shell.execute_reply.started": "2024-10-01T08:57:57.470851Z"
    }
   },
   "outputs": [],
   "source": [
    "from IPython.display import Image\n",
    "import os\n",
    "os.environ['http_proxy'] = 'http://127.0.0.1:7890'\n",
    "os.environ['https_proxy'] = 'http://127.0.0.1:7890'"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ae59abd8-9f41-483e-a0e6-09732507c297",
   "metadata": {},
   "source": [
    "- references\n",
    "    - https://tigress-web.princeton.edu/~jdh4/PyTorchPerformanceTuningGuide_GTC2021.pdf\n",
    "        - https://pytorch.org/blog/optimizing-cuda-rnn-with-torchscript/\n",
    "    - https://towardsdatascience.com/optimize-pytorch-performance-for-speed-and-memory-efficiency-2022-84f453916ea6\n",
    "    - https://blog.dailydoseofds.com/p/memory-pinning-to-accelerate-model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "bc0214af-6775-4bd7-94b2-7e549808cd0a",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-10-01T08:57:58.529049Z",
     "iopub.status.busy": "2024-10-01T08:57:58.528510Z",
     "iopub.status.idle": "2024-10-01T08:57:59.794500Z",
     "shell.execute_reply": "2024-10-01T08:57:59.793676Z",
     "shell.execute_reply.started": "2024-10-01T08:57:58.529007Z"
    }
   },
   "outputs": [],
   "source": [
    "import torch\n",
    "import torch.jit\n",
    "import timeit"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c0e95e98-0359-4c53-ad57-a720a17a1d96",
   "metadata": {},
   "source": [
    "### pin_memory & non_blocking (num_workers)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a591e76b-a95b-41e7-bf45-cd2b83a0b4dd",
   "metadata": {},
   "source": [
    "```\n",
    "train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)\n",
    "for epoch in range(epochs): \n",
    "    for batch_idx, (data, target) in enumerate(train_loader):\n",
    "        \n",
    "        data, target = data.to(device), target.to(device)\n",
    "        \n",
    "        optimizer.zero_grad()\n",
    "        output = model(data)\n",
    "        loss = criterion(output, target)\n",
    "        loss.backward()        \n",
    "        optimizer.step()\n",
    "```\n",
    "\n",
    "standard PyTorch model training loop\n",
    "- `data, target = data.to(device), target.to(device)` transfers the data to the GPU from the CPU.\n",
    "- Everything executes on the GPU **after** the data transfer."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "e9902cea-395b-4cc9-a95a-f2477e0dd9ec",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-09-28T03:42:47.578751Z",
     "iopub.status.busy": "2024-09-28T03:42:47.578118Z",
     "iopub.status.idle": "2024-09-28T03:42:47.591618Z",
     "shell.execute_reply": "2024-09-28T03:42:47.589501Z",
     "shell.execute_reply.started": "2024-09-28T03:42:47.578702Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<img src=\"https://substackcdn.com/image/fetch/w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5956919f-affc-4206-8296-03ff549476db_2160x780.png\" width=\"500\"/>"
      ],
      "text/plain": [
       "<IPython.core.display.Image object>"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# synchronization\n",
    "Image(url='https://substackcdn.com/image/fetch/w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5956919f-affc-4206-8296-03ff549476db_2160x780.png', width=500)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5466b43f-19f9-4902-814e-2cfab642ef0f",
   "metadata": {},
   "source": [
    "- When the model is being trained on the 1st mini-batch, the CPU can transfer the 2nd mini-batch to the GPU.\n",
    "- This ensures that the GPU does not have to wait for the next mini-batch of data as soon as it completes processing an existing mini-batch."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "5577090c-b7d0-4b92-871e-6adec995137a",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-09-28T03:43:53.704141Z",
     "iopub.status.busy": "2024-09-28T03:43:53.703529Z",
     "iopub.status.idle": "2024-09-28T03:43:53.716647Z",
     "shell.execute_reply": "2024-09-28T03:43:53.714743Z",
     "shell.execute_reply.started": "2024-09-28T03:43:53.704094Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<img src=\"https://substackcdn.com/image/fetch/w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdb51e307-211f-479e-a677-b6b8ad69fcfd_2370x780.png\" width=\"500\"/>"
      ],
      "text/plain": [
       "<IPython.core.display.Image object>"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "Image(url='https://substackcdn.com/image/fetch/w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdb51e307-211f-479e-a677-b6b8ad69fcfd_2370x780.png', width=500)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "03352596-3e9d-49d1-8231-813f88fa3cf0",
   "metadata": {},
   "source": [
    "- While the CPU may remain idle, this process ensures that the GPU (which is the actual accelerator for our model training) always has data to work with.\n",
    "    - 尽可能地避免 gpu 的 idle\n",
    "- Formally, this process is known as **memory pinning**, and it is used to speed up the data transfer from the CPU to the GPU by making the training workflow asynchronous.\n",
    "    - 内存的 swapping（交换）是指操作系统将物理内存中的数据临时存储到硬盘上的交换空间（**Swap Space**,，位于硬盘）或交换文件（Swap File）中的过程。当系统的物理内存（RAM）不足以容纳正在运行的所有程序和数据时，操作系统会将一部分当前不活跃的数据从内存中移出，写入交换空间，从而腾出物理内存供其他程序使用。\n",
    "    - pin_memory 指的是将内存“锁页”（pin memory，vs. 可分页（pageable）虚拟内存），即将内存页固定在物理内存中，防止操作系统将其交换到磁盘（swapping）。这种锁页内存也被称为“页面锁定内存”或“固定内存”。\n",
    "        - pinned memory，又称page-locked memory\n",
    "        - unpinned memory，又称pageable memory\n",
    "    - pinned memory is used as a staging area for transfers from the device to the host. We can avoid the cost of the transfer between pageable and pinned host arrays by directly allocating our host arrays in pinned memory."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "e82cca4b-d5b8-405e-85d3-f234c85f2f1c",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-10-01T09:06:47.823752Z",
     "iopub.status.busy": "2024-10-01T09:06:47.823114Z",
     "iopub.status.idle": "2024-10-01T09:06:47.835567Z",
     "shell.execute_reply": "2024-10-01T09:06:47.833481Z",
     "shell.execute_reply.started": "2024-10-01T09:06:47.823706Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<img src=\"https://developer-blogs.nvidia.com/wp-content/uploads/2012/12/pinned-1024x541.jpg\" width=\"500\"/>"
      ],
      "text/plain": [
       "<IPython.core.display.Image object>"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# https://developer.nvidia.com/blog/how-optimize-data-transfers-cuda-cc/\n",
    "Image(url='https://developer-blogs.nvidia.com/wp-content/uploads/2012/12/pinned-1024x541.jpg', width=500)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "e11262e3-e5e5-4032-a164-2871c098f92a",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-10-01T08:59:18.561060Z",
     "iopub.status.busy": "2024-10-01T08:59:18.560745Z",
     "iopub.status.idle": "2024-10-01T08:59:18.571381Z",
     "shell.execute_reply": "2024-10-01T08:59:18.569311Z",
     "shell.execute_reply.started": "2024-10-01T08:59:18.561038Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(tensor([ 0.8242,  0.5101,  1.0573, -0.3424,  1.5565]), False)"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X = torch.randn(5)\n",
    "X, X.is_pinned()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "5d39e8cf-0653-4445-a097-11778687f3ba",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-10-01T08:59:19.677963Z",
     "iopub.status.busy": "2024-10-01T08:59:19.677378Z",
     "iopub.status.idle": "2024-10-01T08:59:19.689556Z",
     "shell.execute_reply": "2024-10-01T08:59:19.687309Z",
     "shell.execute_reply.started": "2024-10-01T08:59:19.677919Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X = X.pin_memory()\n",
    "X.is_pinned()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "1d5b1fdf-a2b6-4ef1-ac60-894a9013ff88",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-10-01T09:00:01.315117Z",
     "iopub.status.busy": "2024-10-01T09:00:01.314694Z",
     "iopub.status.idle": "2024-10-01T09:00:01.534909Z",
     "shell.execute_reply": "2024-10-01T09:00:01.533019Z",
     "shell.execute_reply.started": "2024-10-01T09:00:01.315089Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "tensor([-1.5154, -1.1938, -0.9721,  0.0941, -0.5044], device='cuda:0') False\n"
     ]
    },
    {
     "ename": "RuntimeError",
     "evalue": "cannot pin 'torch.cuda.FloatTensor' only dense CPU tensors can be pinned",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mRuntimeError\u001b[0m                              Traceback (most recent call last)",
      "Cell \u001b[0;32mIn[12], line 3\u001b[0m\n\u001b[1;32m      1\u001b[0m X \u001b[38;5;241m=\u001b[39m torch\u001b[38;5;241m.\u001b[39mrandn(\u001b[38;5;241m5\u001b[39m, device\u001b[38;5;241m=\u001b[39m\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mcuda\u001b[39m\u001b[38;5;124m'\u001b[39m)\n\u001b[1;32m      2\u001b[0m \u001b[38;5;28mprint\u001b[39m(X, X\u001b[38;5;241m.\u001b[39mis_pinned())\n\u001b[0;32m----> 3\u001b[0m X \u001b[38;5;241m=\u001b[39m \u001b[43mX\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mpin_memory\u001b[49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m      4\u001b[0m \u001b[38;5;28mprint\u001b[39m(X, X\u001b[38;5;241m.\u001b[39mis_pinned())\n",
      "\u001b[0;31mRuntimeError\u001b[0m: cannot pin 'torch.cuda.FloatTensor' only dense CPU tensors can be pinned"
     ]
    }
   ],
   "source": [
    "X = torch.randn(5, device='cuda')\n",
    "print(X, X.is_pinned())\n",
    "X = X.pin_memory()\n",
    "print(X, X.is_pinned())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2a3b4f73-cd4c-4392-8eeb-4af44c5d7745",
   "metadata": {},
   "source": [
    "- enable `pin_memory` and set num_workers (muti-core processors) for faster transfers\n",
    "```\n",
    "train_loader = DataLoader(train_dataset,\n",
    "                          batch_size=64, shuffle=True,\n",
    "                          pin_memory=True, num_workers=8)\n",
    "```\n",
    "\n",
    "- during the data transfer step in the training step, specify `non_blocking=True`, as depicted below:\n",
    "    - non_blocking => gpu training on prev minibatch 时执行 cpu 上的 transfer\n",
    "    - When the model is being trained on the 1st mini-batch, the CPU can transfer the 2nd mini-batch to the GPU.\n",
    "    - This ensures that the GPU does not have to wait for the next mini-batch of data as soon as it completes processing an existing mini-batch.\n",
    "```\n",
    "for epoch in range(epochs):\n",
    "    for batch_idx, (data, target) in enumerate(train_loader):\n",
    "        \n",
    "        data = data.to(device, non_blocking=True)\n",
    "        target = target.to(device, non_blocking=True)\n",
    "        \n",
    "        optimizer.zero_grad()        \n",
    "        output = model(data)\n",
    "        loss = criterion(output, target)\n",
    "        loss.backward()\n",
    "        optimizer.step()\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "id": "03dc5452-11d5-4e40-9a6a-34d38cc9f86c",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-09-28T04:03:53.294992Z",
     "iopub.status.busy": "2024-09-28T04:03:53.294606Z",
     "iopub.status.idle": "2024-09-28T04:03:53.416247Z",
     "shell.execute_reply": "2024-09-28T04:03:53.414335Z",
     "shell.execute_reply.started": "2024-09-28T04:03:53.294949Z"
    }
   },
   "outputs": [],
   "source": [
    "import torch\n",
    "import torch.nn as nn\n",
    "import torch.optim as optim\n",
    "from torch.utils.data import DataLoader\n",
    "from torchvision import datasets, transforms\n",
    "import time\n",
    "from tqdm.notebook import tqdm"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "62a09058-4311-43ab-b20a-5401250b5adb",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-09-28T03:54:05.509544Z",
     "iopub.status.busy": "2024-09-28T03:54:05.509123Z",
     "iopub.status.idle": "2024-09-28T03:54:05.517944Z",
     "shell.execute_reply": "2024-09-28T03:54:05.515944Z",
     "shell.execute_reply.started": "2024-09-28T03:54:05.509521Z"
    }
   },
   "outputs": [],
   "source": [
    "# 定义数据转换\n",
    "transform = transforms.Compose([\n",
    "    transforms.ToTensor(),\n",
    "    transforms.Normalize((0.5,), (0.5,))\n",
    "])\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "f25239c6-820b-473e-96e7-bca3cd340400",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-09-28T03:55:26.604882Z",
     "iopub.status.busy": "2024-09-28T03:55:26.604408Z",
     "iopub.status.idle": "2024-09-28T03:55:26.714990Z",
     "shell.execute_reply": "2024-09-28T03:55:26.714143Z",
     "shell.execute_reply.started": "2024-09-28T03:55:26.604847Z"
    }
   },
   "outputs": [],
   "source": [
    "# 加载 MNIST 数据集\n",
    "train_dataset = datasets.MNIST(root='./data', train=True, download=True, transform=transform)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "cf904b09-0d6b-43f0-aae9-6a13a057ecc2",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-09-28T03:56:59.893949Z",
     "iopub.status.busy": "2024-09-28T03:56:59.893569Z",
     "iopub.status.idle": "2024-09-28T03:56:59.927921Z",
     "shell.execute_reply": "2024-09-28T03:56:59.925551Z",
     "shell.execute_reply.started": "2024-09-28T03:56:59.893922Z"
    }
   },
   "outputs": [],
   "source": [
    "class SimpleNet(nn.Module):\n",
    "    def __init__(self):\n",
    "        super(SimpleNet, self).__init__()\n",
    "        self.fc = nn.Sequential(\n",
    "            nn.Flatten(),\n",
    "            nn.Linear(28*28, 512),\n",
    "            nn.ReLU(),\n",
    "            nn.Linear(512, 10)\n",
    "        )\n",
    "    def forward(self, x):\n",
    "        return self.fc(x)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "05414939-c08b-47e4-bad4-5ca518f834b8",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-09-28T03:57:04.511457Z",
     "iopub.status.busy": "2024-09-28T03:57:04.510821Z",
     "iopub.status.idle": "2024-09-28T03:57:04.571665Z",
     "shell.execute_reply": "2024-09-28T03:57:04.570082Z",
     "shell.execute_reply.started": "2024-09-28T03:57:04.511410Z"
    }
   },
   "outputs": [],
   "source": [
    "device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "id": "43711554-7122-429a-ac06-aaca5bba2d76",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-09-28T04:03:57.047013Z",
     "iopub.status.busy": "2024-09-28T04:03:57.046648Z",
     "iopub.status.idle": "2024-09-28T04:03:57.058402Z",
     "shell.execute_reply": "2024-09-28T04:03:57.055990Z",
     "shell.execute_reply.started": "2024-09-28T04:03:57.046988Z"
    }
   },
   "outputs": [],
   "source": [
    "def train(loader, model, optimizer, device, non_blocking=False, epochs=5):\n",
    "    model.train()\n",
    "    total_loss = 0\n",
    "    start_time = time.time()\n",
    "    for epoch in tqdm(range(epochs), desc=\"Epochs\"):\n",
    "        batch_bar = tqdm(enumerate(loader), total=len(loader), desc=f\"Epoch {epoch+1}/{epochs}\", leave=False)\n",
    "        for batch_idx, (data, target) in batch_bar:\n",
    "            # 将数据移到设备上\n",
    "            data = data.to(device, non_blocking=non_blocking)\n",
    "            target = target.to(device, non_blocking=non_blocking)\n",
    "            \n",
    "            optimizer.zero_grad()\n",
    "            output = model(data)\n",
    "            loss = nn.functional.cross_entropy(output, target)\n",
    "            loss.backward()\n",
    "            optimizer.step()\n",
    "            \n",
    "            total_loss += loss.item()\n",
    "    end_time = time.time()\n",
    "    avg_loss = total_loss / (len(loader)*epochs)\n",
    "    elapsed_time = end_time - start_time\n",
    "    return avg_loss, elapsed_time"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "id": "b34821d9-477d-4f9f-a66e-ac1b115dcf3a",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-09-28T04:03:59.595257Z",
     "iopub.status.busy": "2024-09-28T04:03:59.594608Z",
     "iopub.status.idle": "2024-09-28T04:03:59.605255Z",
     "shell.execute_reply": "2024-09-28T04:03:59.603010Z",
     "shell.execute_reply.started": "2024-09-28T04:03:59.595189Z"
    }
   },
   "outputs": [],
   "source": [
    "from multiprocessing import cpu_count\n",
    "loader1 = DataLoader(train_dataset, batch_size=64, shuffle=True)\n",
    "loader2 = DataLoader(train_dataset, batch_size=64, shuffle=True, num_workers=cpu_count()//2, pin_memory=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "id": "bc459cc9-13c3-4b5a-a0ff-d54a94d10eb7",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-09-28T04:04:00.672015Z",
     "iopub.status.busy": "2024-09-28T04:04:00.671382Z",
     "iopub.status.idle": "2024-09-28T04:04:43.410715Z",
     "shell.execute_reply": "2024-09-28T04:04:43.409719Z",
     "shell.execute_reply.started": "2024-09-28T04:04:00.671968Z"
    }
   },
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "6168ba4dd6f44b428ab72433c75c5bf2",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Epochs:   0%|          | 0/5 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Epoch 1/5:   0%|          | 0/938 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Epoch 2/5:   0%|          | 0/938 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Epoch 3/5:   0%|          | 0/938 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Epoch 4/5:   0%|          | 0/938 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Epoch 5/5:   0%|          | 0/938 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "设置 1 - 平均损失: 0.3854, 训练时间: 42.71 秒\n"
     ]
    }
   ],
   "source": [
    "model1 = SimpleNet().to(device)\n",
    "optimizer1 = optim.SGD(model1.parameters(), lr=0.01)\n",
    "avg_loss1, time1 = train(loader1, model1, optimizer1, device, non_blocking=False, epochs=5)\n",
    "print(f\"设置 1 - 平均损失: {avg_loss1:.4f}, 训练时间: {time1:.2f} 秒\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "id": "f5b45b94-1923-4c7c-9193-114ec2ddaf5a",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-09-28T04:04:45.048577Z",
     "iopub.status.busy": "2024-09-28T04:04:45.048317Z",
     "iopub.status.idle": "2024-09-28T04:05:04.980463Z",
     "shell.execute_reply": "2024-09-28T04:05:04.979050Z",
     "shell.execute_reply.started": "2024-09-28T04:04:45.048561Z"
    }
   },
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "df16ac725fcd4680b6dd05b2383d382e",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Epochs:   0%|          | 0/5 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Epoch 1/5:   0%|          | 0/938 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Epoch 2/5:   0%|          | 0/938 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Epoch 3/5:   0%|          | 0/938 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Epoch 4/5:   0%|          | 0/938 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Epoch 5/5:   0%|          | 0/938 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "设置 2 - 平均损失: 0.3847, 训练时间: 19.92 秒\n"
     ]
    }
   ],
   "source": [
    "model2 = SimpleNet().to(device)\n",
    "optimizer2 = optim.SGD(model2.parameters(), lr=0.01)\n",
    "avg_loss2, time2 = train(loader2, model2, optimizer2, device, non_blocking=True, epochs=5)\n",
    "print(f\"设置 2 - 平均损失: {avg_loss2:.4f}, 训练时间: {time2:.2f} 秒\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7f8fca14-1133-482c-8405-ae5138245bd9",
   "metadata": {},
   "source": [
    "#### transformers Trainer"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d6b3d950-bf7d-4279-a57c-f1d47c55ba36",
   "metadata": {},
   "source": [
    "- When using Trainer, the corresponding TrainingArguments are:\n",
    "    - `dataloader_pin_memory` (True by default),\n",
    "    - `dataloader_num_workers` (defaults to 0).\n",
    "- non_blocking\n",
    "    - 从 `trainer.py` 源码中来看似乎是从 accelerate_config 中设置的；??"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "90632d8b-42d5-4281-b163-63b790dc71b8",
   "metadata": {},
   "source": [
    "### JIT（Just-In-Time compilation) ）"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8a42c205-ad0e-4556-9a46-2c0f159964c7",
   "metadata": {},
   "source": [
    "- JIT 通过将模型编译成中间表示（Intermediate Representation, IR），然后进一步将其转换为机器代码\n",
    "- Fuse the pointwise (elementwise) operations into a single kernel by PyTorch JIT\n",
    "    - JIT fuse the pointwise operations"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "7af19e9d-9e20-4810-83ae-b369081f8765",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "20.05574530499871 31.39065190600013\n"
     ]
    }
   ],
   "source": [
    "# 创建一个大型的随机张量作为输入数据\n",
    "x = torch.randn(15000, 15000)\n",
    "\n",
    "# 使用 JIT 编译的函数\n",
    "@torch.jit.script\n",
    "def fused_gelu(x):\n",
    "    return x * 0.5 * (1.0 + torch.erf(x / 1.41421))\n",
    "\n",
    "# 未使用 JIT 编译的相同函数\n",
    "def gelu(x):\n",
    "    return x * 0.5 * (1.0 + torch.erf(x / 1.41421))\n",
    "\n",
    "# 使用 timeit 测量 JIT 编译函数的执行时间\n",
    "jit_time = timeit.timeit('fused_gelu(x)', globals=globals(), number=100)\n",
    "nonjit_time = timeit.timeit('gelu(x)', globals=globals(), number=100)\n",
    "\n",
    "print(jit_time, nonjit_time)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7d261c33-5bf7-4f1e-959c-8052b6e1ceb5",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
